Welcome to my Data Science project on CHURN PREDICTION for bank customers.
A real dataset with user characteristics & behavioral features for 10,000 customers of a bank is explored, processed and utilized to train a Machine Learning model to predict churn probability. Data exploration and model interpretation findings are turned into actionable information for the bank to reduce churn. An optimized probability cutoff is estimated for the classification model, to minimize bank costs. The following sections will navigate you through all the above.
I also invite you to visit My LinkedIn profile and see my other projects in My GitHub profile.
Sincerely,
Michail Mavrogiannis
The dataset used as input in the present project was obtained from the "Churn for Bank Customers" post, created by user Mehmet A., on kaggle.com website. The dataset was published on kaggle.com under "CCO: Public Domain" License. The dataset column names and order have been modified for easier reference.
Import libraries below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
Import dataset into dataframe "dfr".
dfr = pd.read_csv('C:/Users/Michael/Desktop/Data Science- MM/Churn/Data.csv')
dfr.head()
Description of Features:
Column 'surname' will be omitted since it does not provide information useful to predicting customer churn probability:
dfr.drop('surname', axis = 1, inplace = True)
dfr.info()
dfr.drop('customer_id', axis = 1).describe()
The dataset does not contain missing values, as shown above.
In the general case that a dataset contains missing data, the frequency (in how many rows/columns) and the reason for being missing (completely at random, at random, not at random) shall be examined. Depending on the case, null data can be either omitted, or imputed using summary statistics (e.g. mean, mode), or estimated through a Machine Learning model.
In the following sections all features will be explored, and any unreasonable values or outliers will be identified.
This section presents an exploratory analysis of the response variable (churn) and the customers' characteristic and behavioral features. The distribution of each feature as well as the variance of the churn probability per the feature values are obtained and evaluated through analytics, graphs, and decision trees as required. Preliminary conclusions are reached about the relationship between the features and the churn probability.
dfr['churn'].value_counts(normalize = True)
The overall churn rate is approximately 20%, which is within the expected range for retail banks.
sns.countplot(dfr['churn'])
plt.title('Churn Histogram'); plt.ylabel('# of customers'); plt.show()
dfr['customer_id'].nunique(), len(dfr)
It is confirmed that the dataset contains unique customers only.
This section explores the feature of country of residence for the bank customers, and provides a per-country summary of the total number of customers, their portion within the entire dataset, number of customers who remained with or left the bank, and churn probability. This information is provided;
In the next sections for the rest of the features, the summary dataframe will be omitted and the above information will be provided visually.
country = pd.DataFrame()
country['Customers'] = dfr.groupby('country').count()['customer_id']
country['Customer Portion'] = (country['Customers'] / len(dfr)).round(2)
country['Remained'] = dfr[dfr['churn'] == 0].groupby('country').count()['customer_id']
country['Churned'] = dfr[dfr['churn'] == 1].groupby('country').count()['customer_id']
country['Churn Probability'] = dfr.groupby('country').mean()['churn'].round(2)
country
graph = sns.countplot(data = dfr, x = 'country', color = 'lightblue', order = ['France', 'Germany', 'Spain'])
plt.title('Country Histogram'); plt.ylabel('# of customers'); plt.show()
sns.countplot(data = dfr, x = 'country', hue = 'churn', order = ['France', 'Germany', 'Spain'])
plt.title('Churn Histogram per Country'); plt.ylabel('# of customers'); plt.show()
sns.barplot(data = dfr, x = 'country', y = 'churn', estimator = np.mean, ci = None, color = 'lightgreen',
order = ['France', 'Germany', 'Spain'])
plt.ylabel('churn probability'); plt.title('Churn Probability per Country'); plt.show()
It is observed that:
dfr['age'].min(), dfr['age'].max()
plt.figure(figsize = (18,5))
sns.countplot(data = dfr, x = 'age', color = 'lightblue', order = range(18, 93))
plt.title('Age Histogram'); plt.ylabel('# of customers'); plt.show()
plt.figure(figsize = (18,5))
sns.countplot(data = dfr, x = 'age', hue = 'churn', order = range(18, 93))
plt.title('Churn Histogram per Age'); plt.ylabel('# of customers'); plt.show()
The middle 80% of the age observations fall within the interval:
stats.scoreatpercentile(dfr['age'], 10), stats.scoreatpercentile(dfr['age'], 90)
plt.figure(figsize = (18,5))
sns.barplot(data = dfr, x = 'age', y = 'churn', order = range(18, 93), estimator = np.mean, ci = None, color = 'lightgreen')
# Plot vertical lines to delimit the middle 80% of the observations:
age_list = list(dfr['age'].sort_values().unique())
plt.axvline(x = age_list.index(stats.scoreatpercentile(dfr['age'], 10)), color = 'blue', linestyle = '--',
label = 'Middle 80% \n Limits')
plt.axvline(x = age_list.index(stats.scoreatpercentile(dfr['age'], 90)), color = 'blue', linestyle = '--')
plt.ylabel('churn probability'); plt.title('Churn Probability per Age'); plt.legend(); plt.show()
It is observed that:
sns.countplot(data = dfr, x = 'gender', color = 'lightblue')
plt.title('Gender Histogram'); plt.ylabel('# of customers'); plt.show()
sns.countplot(data = dfr, x = 'gender', hue = 'churn')
plt.title('Churn Histogram per Gender'); plt.ylabel('# of customers'); plt.show()
sns.barplot(data = dfr, x = 'gender', y = 'churn', color = 'lightgreen', ci = None)
plt.ylabel('churn probability'); plt.title('Churn Probability per Gender'); plt.show()
It is observed that:
Balances are categorized into bins of width 10,000. Based on the customer countries, Euro (€) currency is assumed.
balance_bins = np.linspace(-0.1, 260000, 27)
balance_labels = [f"[{j}k, {j+10}k)" for j in range(0, 260, 10)]
dfr['balance_bin'] = pd.cut(x = dfr['balance'], bins = balance_bins, labels = balance_labels)
dfr[['balance','balance_bin']].head(3)
plt.figure(figsize = (14,5))
sns.countplot(data = dfr, x = 'balance_bin', color = 'lightblue')
plt.title('Balance Histogram'); plt.xlabel('balance [€]'); plt.xticks(rotation = 70); plt.ylabel('# of customers')
plt.show()
len(dfr[(dfr['balance'] > 0) & (dfr['balance'] <= 10000)])
The above histogram of balances does not have the expected form of a highly right-skewed distribution or even a distribution close to a half-normal.
In addition to this, almost 1/3 of observations have a zero balance. This is further investigated in the next section.
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'balance_bin', hue = 'churn')
plt.title('Churn Histogram per Balance'); plt.xlabel('balance [€]'); plt.xticks(rotation = 70); plt.ylabel('# of customers')
plt.legend(loc = ('upper right'), title = 'churn'); plt.show()
Excluding the zero-balance observations, the middle 80% of the remainder observations fall within the interval:
print(stats.scoreatpercentile(dfr[dfr['balance'] != 0]['balance'], 10).round(0),
stats.scoreatpercentile( dfr[dfr['balance'] != 0]['balance'], 90).round(0))
plt.figure(figsize = (14,5))
sns.barplot(data = dfr, x = 'balance_bin', y = 'churn', estimator = np.mean, color = 'lightgreen', ci = None)
# Plot vertical lines to delimit the middle 80% of the observations excluding zero balances:
balance_list = list(dfr['balance_bin'].sort_values().unique())
plt.axvline(x = balance_list.index(pd.cut(x = [stats.scoreatpercentile(dfr[dfr['balance'] != 0]['balance'], 10)],
bins = balance_bins, labels = balance_labels)[0]),
color = 'blue', linestyle = '--', label = 'Middle 80% Limits, \n Excluding 0 Balances')
plt.axvline(x = balance_list.index(pd.cut(x = [stats.scoreatpercentile(dfr[dfr['balance'] != 0]['balance'], 90)],
bins = balance_bins, labels = balance_labels)[0]),
color = 'blue', linestyle = '--')
plt.ylabel('churn probability'); plt.xlabel('balance [€]'); plt.xticks(rotation = 70); plt.legend()
plt.title('Churn Probability per Balance'); plt.show()
It is observed that:
len(dfr[dfr['balance'] == 0]), len(dfr[dfr['balance'] == 0]) / len(dfr)
In order to investigate zero balances, the balance feature will be further "sliced" as per other features of the dataset.
A decision tree is used to find the features whose importance stands out when determining whether a customer has a zero balance or not. To obtain these features a "min_impurity_decrease" limit is used, determined after trial-and-error. The training set of the model includes the predictors: country (dummies), age, gender (dummies), num_products, tenure, credit_card, active_member, credit_score, salary. The response variable is whether a user has zero balance or not (a new column is added to the dataframe for that). No train/test set split is required here.
dfr0 = pd.concat(
[dfr[['age', 'num_products', 'tenure', 'credit_card', 'active_member', 'credit_score', 'salary']],
pd.get_dummies(dfr[['gender', 'country']])],
axis = 1)
dfr['zero_balance'] = dfr['balance'].apply(lambda x: 1 if x == 0 else 0)
from sklearn.tree import DecisionTreeClassifier, plot_tree
mdl0 = DecisionTreeClassifier(min_impurity_decrease = 0.01)
mdl0.fit(dfr0, dfr['zero_balance'])
plt.figure(figsize=(15,12)); plot_tree(mdl0, max_depth = 20, fontsize = 20); plt.show()
dfr0.columns[[10, 1]]
The features of outstanding importance for zero balance are 'Germany' and 'num_products'.
For Germany:
sns.countplot(data = dfr, x = 'country', hue = 'zero_balance')
plt.title('Distribution of Zero Balances per Country'); plt.ylabel('# of customers');
plt.legend(loc = (1.01, 0.78), title = 'zero_balance'); plt.show()
Surprisingly, France and Spain have almost 50% of their customers with zero balances, whereas Germany has no customers with zero balance.
For number of products:
sns.countplot(data = dfr, x = 'num_products', hue = 'zero_balance')
plt.title('Distribution of Zero Balances per # of Products'); plt.ylabel('# of customers');
plt.legend(loc = (1.01, 0.78), title = 'zero_balance'); plt.show()
Customers with 2 products present a remarkably higher probability of having zero balance. This could be because their main product is different than a checking or savings account, so they just also maintain a zero-balanced account.
From the above exploration:
sns.countplot(data = dfr, x = 'num_products', color = 'lightblue')
plt.title('Number of Products Histogram'); plt.ylabel('# of customers'); plt.show()
sns.countplot(data = dfr, x = 'num_products', hue = 'churn')
plt.title('Churn Histogram per Number of Products'); plt.ylabel('# of customers');
plt.legend(loc = ('upper right'), title = 'churn'); plt.show()
sns.barplot(data = dfr, x = 'num_products', y = 'churn', estimator = np.mean, ci = None, color = 'lightgreen')
plt.ylabel('churn probability'); plt.title('Churn Probability per Number of Products'); plt.show()
As seen above:
sns.countplot(data = dfr, x = 'tenure', color = 'lightblue')
plt.title('Tenure Histogram'); plt.xlabel('tenure in years'); plt.ylabel('# of customers'); plt.show()
sns.countplot(data = dfr, x = 'tenure', hue = 'churn')
plt.title('Churn Histogram per Tenure'); plt.xlabel('tenure in years'); plt.ylabel('# of customers');
plt.legend(loc = (1.01, 0.78), title = 'churn'); plt.show()
sns.barplot(data = dfr, x = 'tenure', y = 'churn', estimator = np.mean, ci = None, color = 'lightgreen')
plt.xlabel('tenure in years'); plt.ylabel('churn probability'); plt.title('Churn Probability per Tenure'); plt.show()
It is observed that:
sns.countplot(data = dfr, x = 'credit_card', color = 'lightblue')
plt.title('Credit Card Possession Histogram'); plt.ylabel('# of customers'); plt.show()
sns.countplot(data = dfr, x = 'credit_card', hue = 'churn')
plt.title('Churn Histogram per Credit Card Possession'); plt.ylabel('# of customers'); plt.show()
sns.barplot(data = dfr, x = 'credit_card', y = 'churn', color = 'lightgreen', ci = None)
plt.ylabel('churn probability'); plt.title('Churn Probability per Credit Card Possession'); plt.show()
It is observed that:
sns.countplot(data = dfr, x = 'active_member', color = 'lightblue')
plt.title('Degree-of-Activity Histogram'); plt.ylabel('# of customers'); plt.show()
sns.countplot(data = dfr, x = 'active_member', hue = 'churn')
plt.title('Churn Histogram per Degree-of-Activity'); plt.ylabel('# of customers'); plt.show()
sns.barplot(data = dfr, x = 'active_member', y = 'churn', color = 'lightgreen', ci = None)
plt.ylabel('churn probability'); plt.title('Churn Probability per Degree-of-Activity'); plt.show()
It is observed that:
Credit scores are categorized into bins of width 20 points.
score_bins = np.linspace(349.9, 850, 26)
score_labels = [f"[{j}, {j+20})" for j in range(350, 850, 20)]
dfr['score_bin'] = pd.cut(x = dfr['credit_score'], bins = score_bins, labels = score_labels)
dfr[['credit_score','score_bin']].head(3)
plt.figure(figsize = (14,5))
sns.countplot(data = dfr, x = 'score_bin', color = 'lightblue')
plt.title('Credit Score Histogram'); plt.xlabel('credit_score'); plt.xticks(rotation = 70); plt.ylabel('# of customers')
plt.show()
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'score_bin', hue = 'churn')
plt.title('Churn Histogram per Credit Score'); plt.xlabel('credit_score'); plt.xticks(rotation = 70)
plt.ylabel('# of customers'); plt.show()
The middle 80% of the credit_score observations fall within the interval below. The effect of the second mode of the distribution at the right extreme is not very important, from a number of observations standpoint. Thus the behavior of the middle 80% of the observations is still meaningful.
stats.scoreatpercentile(dfr['credit_score'], 10), stats.scoreatpercentile(dfr['credit_score'], 90)
plt.figure(figsize = (14,5))
sns.barplot(data = dfr, x = 'score_bin', y = 'churn', estimator = np.mean, color = 'lightgreen', ci = None)
# Plot vertical lines to delimit the middle 80% of the observations:
score_list = list(dfr['score_bin'].sort_values().unique())
plt.axvline(x =
score_list.index(pd.cut(x= [stats.scoreatpercentile(dfr['credit_score'], 10)], bins= score_bins, labels = score_labels)[0]),
color = 'blue', linestyle = '--', label = 'Middle 80% \n Limits')
plt.axvline(x =
score_list.index(pd.cut(x= [stats.scoreatpercentile(dfr['credit_score'], 90)], bins= score_bins, labels = score_labels)[0]),
color = 'blue', linestyle = '--')
plt.xlabel('credit_score'); plt.ylabel('churn probability'); plt.title('Churn Probability per Credit Score');
plt.xticks(rotation = 70); plt.legend(); plt.show()
It is observed that:
Salaries are categorized into bins of width 10,000 units. Based on the customer countries, Euro (€) currency is assumed.
salary_bins = np.linspace(0, 200000, 21)
salary_labels = [f"[{j}k, {j+10}k)" for j in range(0, 200, 10)]
dfr['salary_bin'] = pd.cut(x = dfr['salary'], bins = salary_bins, labels = salary_labels)
dfr[['salary','salary_bin']].head(3)
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'salary_bin', color = 'lightblue')
plt.title('Salary Histogram'); plt.xticks(rotation = 70); plt.xlabel('salary [€]'); plt.ylabel('# of customers')
plt.show()
plt.figure(figsize = (14, 5))
sns.countplot(data = dfr, x = 'salary_bin', hue = 'churn')
plt.title('Churn Histogram per Salary'); plt.xticks(rotation = 70); plt.xlabel('salary [€]'); plt.ylabel('# of customers')
plt.legend(loc = (1.01, 0.83), title = 'churn'); plt.show()
plt.figure(figsize = (14,5))
sns.barplot(data = dfr, x = 'salary_bin', y = 'churn', estimator = np.mean, color = 'lightgreen', ci = None)
plt.title('Churn Probability per Salary'); plt.xticks(rotation = 70); plt.xlabel('salary [€]');
plt.ylabel('churn probability'); plt.show()
As seen above:
The dataset will be used to train a Machine Learning model for predicting churn probability for the bank customers. It is assumed that the dataset on hand is representative of the bank customers, or of a subset of interest.
A Random Forest model is selected because it works well with nonlinear relationships between predictors (features) and the response variable, it works well with both numerical and categorical variables, it is robust against outliers, and does not require normalization of the features. Feature importances and partial dependence plots will be used for interpretating the model.
Categorical variables country and gender are converted to dummies. For the case of country, the "drop_first" option is not selected so that all countries appear explicitly at the feature importance table. The fact that country dummies are correlated has a negligible effect on the model.
country_dum = pd.get_dummies(dfr['country'])
gender_dum = pd.get_dummies(dfr['gender'], drop_first = True)
dfr.columns
X = pd.concat([dfr[['age', 'balance', 'num_products', 'tenure', 'credit_card', 'active_member', 'credit_score', 'salary']],
country_dum, gender_dum], axis = 1)
X.head(2)
y = dfr['churn']
y.head(2)
Train and test sets are created below:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)
plt.figure(figsize = (10,8))
sns.heatmap(X.corr(method = 'pearson').round(2), annot = True)
plt.title('Pearson Correlation of Features'); plt.show()
Features are not highly correlated in general. Please refer to the previous section regarding the correlation between countries.
The Random Forest hyperparameters selected for tuning are: number of decision trees (to avoid high variance), maximum depth of trees (to avoid overfitting), and class weight (to penalize incorrectly classified examples, since we have unbalanced classes). A relatively limited tuning is performed here for the sake of simplicity. In reality tuning would include more parameters and more candidate values of these parameters to be checked.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator = RandomForestClassifier(),
param_grid = {'n_estimators' : [50, 200, 400],
'max_depth': [4, 6, 8],
'class_weight': [{0: 1, 1: 0.8}, {0: 1, 1: 1.0}, {0: 1, 1: 1.2}, {0: 1, 1: 1.5}]},
verbose = 1, n_jobs = -1)
grid.fit(X_train, y_train)
grid.best_params_, grid.best_score_.round(3)
The "grid.best_score" found above is the mean cross-validated score (accuracy) of the model with the optimal hyperparameters.
A Random Forest Classifier is now trained based on the hyperparameters selected above.
from sklearn.ensemble import RandomForestClassifier
mdl = RandomForestClassifier(n_estimators = grid.best_params_['n_estimators'],
max_depth = grid.best_params_['max_depth'],
class_weight = grid.best_params_['class_weight'], oob_score = True, n_jobs = -1)
mdl.fit(X_train, y_train)
y_pred = mdl.predict(X_test)
y_prob = mdl.predict_proba(X_test)
from sklearn.metrics import classification_report, confusion_matrix, \
f1_score, precision_score, recall_score, accuracy_score
c_m = pd.DataFrame(data = confusion_matrix(y_test, y_pred)); c_m.index.names = ['Actual']; c_m.columns.names = ['Predicted']
print(classification_report(y_test, y_pred)); c_m
print('The out-of-bag accuracy is {}.'. format(mdl.oob_score_.round(3)))
print('The training set accuracy is {}.'.format(accuracy_score(y_train, mdl.predict(X_train)).round(3)))
print('The test set accuracy is {}.'.format(accuracy_score(y_test, y_pred).round(3)))
It is observed that:
ftrs = pd.DataFrame(data = mdl.feature_importances_, index = X.columns, columns = ['importance'])
ftrs.sort_values('importance', ascending = False, inplace = True)
plt.figure(figsize = (14,6))
sns.barplot(data = ftrs, y = ftrs.index, x = 'importance', color = 'lightgreen')
plt.xlabel('Importance'); plt.ylabel('Model Features'); plt.title('Feature Importances'); plt.show()
In general the feature importances seem to agree with the conclusions of the data exploration for each feature; specifically:
Partial dependence plots for the most important features are presented below, to visualize the marginal effect of each of these features on the response variable. The vertical lines on the x-axes of the plots indicate the deciles of the feature values.
from sklearn.inspection import partial_dependence, plot_partial_dependence
plot_partial_dependence(estimator = mdl, X = X_train, features = ['age'], percentiles = (0,1))
plt.title('age: PDP'); plt.ylim([0, 0.55]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['num_products'], percentiles = (0,1))
plt.title('num_products: PDP'); plt.ylim([0, 0.8]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['balance'], percentiles = (0,1))
plt.title('balance: PDP'); plt.ylim([0, 0.40]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['active_member'], percentiles = (0,1))
plt.title('active_member: PDP'); plt.ylim([0, 0.35]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['Germany'], percentiles = (0,1))
plt.title('Germany: PDP'); plt.ylim([0, 0.35]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['Male'], percentiles = (0,1))
plt.title('gender: PDP'); plt.ylim([0, 0.35]); plt.show()
Partial dependence plots for the least important features are presented below.
plot_partial_dependence(estimator = mdl, X = X_train, features = ['credit_score'], percentiles = (0,1))
plt.title('credit_score: PDP'); plt.ylim([0, 0.5]); plt.show()
The peak of partial dependence for credit scores < 400 corresponds to a negligibly small portion of customers.
plot_partial_dependence(estimator = mdl, X = X_train, features = ['salary'], percentiles = (0,1))
plt.title('salary: PDP'); plt.ylim([0, 0.5]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['tenure'], percentiles = (0,1))
plt.title('tenure: PDP'); plt.ylim([0, 0.5]); plt.show()
plot_partial_dependence(estimator = mdl, X = X_train, features = ['credit_card'], percentiles = (0,1))
plt.title('credit_card: PDP'); plt.ylim([0, 0.5]); plt.show()
No plots needed for combined features because they are not generally correlated, as seen previously.
By default, the cutoff probability for an example to be classified as label "1" is 50%. As seen in Model Training section, the test set yields a certain number of False Negatives (FN) and False Positives (FP), between which there is a trade-off behavior.
A FN means that a customer is predicted as unlikely to leave, but eventually leaves. From a business standpoint, churned customers mean a high cost for the bank. This includes the cost of acquiring new customers, the increased need for customer support until new customers get familiar with the products, etc. Therefore there is a need to minimize the FN but not eliminate them, since this would cause a boost to the FP. Despite lower than the FN cost, FP also cause a cost to the bank. FP cost includes targeting the customers predicted as likely to leave with ads and marketing promotions, to prevent them from leaving. A cutoff probability optimization is performed below in order to minimize the bank's expected cost per customer. Assumed are the hypothetical values:
# Find new cutoff for minimum expected cost per customer:
proba = pd.DataFrame(data = mdl.predict_proba(X_test), columns = ['prob_0', 'prob_1'])
cost_fn = 350 ; cost_fp = 100
expected_cost = [] ; f1 = []
for i in np.linspace(0, 100, 101):
proba['new_class'] = proba['prob_1'].apply(lambda x: 1 if x >= i/100 else 0)
c_m = confusion_matrix(y_test, proba['new_class'])
expected_cost.append(c_m[0][1] / c_m.sum() * cost_fp + c_m[1][0] / c_m.sum() * cost_fn)
f1.append(f1_score(y_test, proba['new_class']))
cut_off = np.argmin(expected_cost) / 100
# Plot expected cost, f1 score:
plt.figure(figsize = (10, 5))
plt.subplot(1,2,1)
plt.plot(np.linspace(0, 1, 101), expected_cost)
plt.title('Expected Cost Optimization'); plt.xlabel('Cutoff probability for label "1"')
plt.ylabel('Expected cost per customer')
plt.subplot(1,2,2)
plt.plot(np.linspace(0, 1, 101), f1, label = 'f1 score', color = 'orange')
plt.title('f1-Score'); plt.xlabel('cutoff probability for label "1"'); plt.ylabel('f1-Score')
plt.show()
# Print cutoff probability, min expected cost, confusion matrix:
print('The optimized Cutoff Probability for Class "1" is {}, \nand the minimum Expected Cost is {} € per customer.'\
.format(cut_off, min(expected_cost).round(1)))
c_m = pd.DataFrame(data = confusion_matrix(y_test, proba['prob_1'].apply(lambda x: 1 if x >= cut_off else 0)))
c_m.index.names = ['Actual']; c_m.columns.names = ['Predicted w/ New Cutoff']
c_m
As expected by the assumed ratio of the FP / FN cost per customer, the optimized cutoff probability caused more examples to "move" from the FN category to FP.
To minimize the bank's overall expected cost per customer, an optimal cutoff probability for the classifier was found previously. This new cutoff point was based on assumed hypothetical values for the bank's average cost per False Positive (FP) and per False Negative (FN) case.
---------------------------------------------------------------------------- / END OF NOTEBOOK, THANK YOU! / ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- © 2021 Michail Mavrogiannis
You are welcome to visit My LinkedIn profile and see my other projects in My GitHub profile.
Michail Mavrogiannis